Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference

نویسندگان

  • Jon Chamberlain
  • Massimo Poesio
  • Udo Kruschwitz
چکیده

Natural Language Engineering tasks require large and complex annotated datasets to build more advanced models of language. Corpora are typically annotated by several experts to create a gold standard; however, there are now compelling reasons to use a non-expert crowd to annotate text, driven by cost, speed and scalability. Phrase Detectives Corpus 1.0 is an anaphorically-annotated corpus of encyclopedic and narrative text that contains a gold standard created by multiple experts, as well as a set of annotations created by a large non-expert crowd. Analysis shows very good inter-expert agreement (κ = .88 − .93) but a more variable baseline crowd agreement (κ = .52 − .96). Encyclopedic texts show less agreement (and by implication are harder to annotate) than narrative texts. The release of this corpus is intended to encourage research into the use of crowds for text annotation and the development of more advanced, probabilistic language models, in particular for anaphoric coreference.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Phrase Detectives

In this Chapter we discuss Phrase Detectives, a Game-With-A-Purpose (GWAP) for anaphoric annotation that has been one of the first GWAPs for corpus annotation and one of the most successful. We discuss the architecture of the game, evaluate its results in terms of quantity and quality of data, and explain how the data were used to create a publically available corpus that can be used to study a...

متن کامل

The Phrase Detective Multilingual Corpus, Release 0.1

The Phrase Detectives Game-With-A-Purpose for anaphoric annotation has been live since December 2008, collecting over 2.5 million judgments on the anaphoric expressions in texts in two languages (English and Italian) from around 9,000 players. In this paper we summarize our recent work on creating a corpus using these annotations.

متن کامل

Corpus - Based Identi cation of Non - Anaphoric NounPhrasesDavid

Coreference resolution involves nding antecedents for anaphoric discourse entities, such as deenite noun phrases. But many deenite noun phrases are not anaphoric because their meaning can be understood from general world knowledge (e.g., \the White House" or \the news media"). We have developed a corpus-based algorithm for automatically identifying deenite noun phrases that are non-anaphoric, w...

متن کامل

Corpus-Based Identification of Non-Anaphoric Noun Phrases

Coreference resolution involves finding antecedents for anaphoric discourse entities, such as definite noun phrases. But many definite noun phrases are not anaphoric because their meaning can be understood from general world knowledge (e.g., "the White House" or "the news media"). We have developed a corpus-based algorithm for automatically identifying definite noun phrases that are non-anaphor...

متن کامل

Learning Noun Phrase Anaphoricity to Improve Conference Resolution: Issues in Representation and Optimization

Knowledge of the anaphoricity of a noun phrase might be profitably exploited by a coreference system to bypass the resolution of non-anaphoric noun phrases. Perhaps surprisingly, recent attempts to incorporate automatically acquired anaphoricity information into coreference systems, however, have led to the degradation in resolution performance. This paper examines several key issues in computi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016